Point of View Missing Data in Phylogenetic Analysis: Reconciling Results from Simulations and Empirical Data

نویسندگان

  • JOHN J. WIENS
  • MATTHEW C. MORRILL
  • Karl Kjer
چکیده

This paper will attempt to resolve some controversies about the effects of missing data on phylogenetic analysis. Whether missing data are generally problematic is a critical issue in modern phylogenetics, especially as wildly different amounts of molecular data become available for different taxa, ranging from entire genomes, to single genes, to none (e.g., fossils). Our perception of the impact of missing data (or lack thereof) may strongly influence which taxa and characters we include in a phylogenetic analysis (Wiens 2006) and may lead to a diversity of serious errors. For example, if we think that missing data are problematic when they are not, then we may exclude taxa and characters that would otherwise benefit our analyses, given the abundant evidence that increasing numbers of both taxa and characters can potentially improve the accuracy of phylogenetic analyses (e.g., Huelsenbeck 1995; Rannala et al. 1998; Poe 2003), where accuracy is generally defined as the similarity between the estimated tree and the correct, known phylogeny. In contrast, if missing data cells are themselves intrinsically problematic (e.g., Huelsenbeck 1991), including taxa or characters with many missing data cells may lead to inaccurate phylogenetic estimates. Several studies have explored how missing data may impact phylogenetic analyses, using both empirical and simulated data. Many simulation and empirical studies now suggest that it is often possible to include taxa that have large amounts of missing data without ill effects (e.g., Wiens 2003b; Driskell et al. 2004; Philippe et al. 2004; Wiens et al. 2005; Wiens and Moen 2008; Lynch and Wagner 2010; Thomson and Shaffer 2010; Wiens, Kuczynski, Townsend, et al. 2010). However, a recent simulation study (Lemmon et al. 2009) suggested instead that missing (“ambiguous”) data are generally problematic for phylogenetic analysis and implied that these previous simulation and empirical studies are therefore incorrect. They justified their study based on the grounds that previous studies were supposedly in conflict about the impacts of missing data (p. 131). In this paper, we will show that the paper by Lemmon et al. (2009; LEA hereafter) is problematic for several reasons. First, despite their statement that previous studies are in conflict, most simulation and empirical results on missing data can be easily explained within an existing theoretical framework (Wiens 2003b). Furthermore, many contradictory studies suggesting that missing data are not generally problematic for Bayesian and likelihood analyses (given some assumptions) were not addressed by LEA. Second, the sweeping negative conclusions of LEA are not necessarily supported by their results. LEA find missing data to be problematic primarily when using sets of invariant or saturated characters and/or when obvious rate heterogeneity is ignored. Their results do not support the idea that missing data generally lead to incorrect inferences about topology when informative data are analyzed with appropriate methods. We conduct new simulations under more realistic conditions, and these results show no evidence that missing data generally lead to inaccurate Bayesian estimates of phylogeny. In fact, we show that the practice of excluding characters simply because they contain missing data cells may itself reduce accuracy. We reanalyze the “manipulated” empirical example from LEA and find that, without these artificial “manipulations” of the data, their conclusions are not supported. We also analyze eight empirical data sets, each containing many taxa with extensive missing data. We show that these incomplete taxa are consistently placed into the expected higher taxa, often with very strong support. Overall, our results confirm previous simulation and empirical studies showing that taxa with extensive missing data can be accurately placed in phylogenetic analyses and that adding characters with missing data can be beneficial (at least under some conditions). We conclude by pointing out important areas for future research on the topic of missing data and phylogenetic analysis.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Missing data in phylogenetic analysis: reconciling results from simulations and empirical data.

This paper will attempt to resolve some controversies about the effects of missing data on phylogenetic analysis. Whether missing data are generally problematic is a critical issue in modern phylogenetics, especially as wildly different amounts of molecular data become available for different taxa, ranging from entire genomes, to single genes, to none (e.g., fossils). Our perception of the impa...

متن کامل

Missing data and the accuracy of Bayesian phylogenetics

The effect of missing data on phylogenetic methods is a potentially important issue in our attempts to reconstruct the Tree of Life. If missing data are truly problematic, then it may be unwise to include species in an analysis that lack data for some characters (incomplete taxa) or to include characters that lack data for some species. Given the difficulty of obtaining data from all characters...

متن کامل

Missing data, incomplete taxa, and phylogenetic accuracy.

The problem of missing data is often considered to be the most important obstacle in reconstructing the phylogeny of fossil taxa and in combining data from diverse characters and taxa for phylogenetic analysis. Empirical and theoretical studies show that including highly incomplete taxa can lead to multiple equally parsimonious trees, poorly resolved consensus trees, and decreased phylogenetic ...

متن کامل

Misconceptions on Missing Data in RAD-seq Phylogenetics with a Deep-scale Example from Flowering Plants.

Restriction-site associated DNA (RAD) sequencing and related methods rely on the conservation of enzyme recognition sites to isolate homologous DNA fragments for sequencing, with the consequence that mutations disrupting these sites lead to missing information. There is thus a clear expectation for how missing data should be distributed, with fewer loci recovered between more distantly related ...

متن کامل

Study the Life Skills of 11-19 year old Children affected by Thalassemia referring to Educational and Remedial Centers in Rasht city from their Mothers’ Point of View 2009-2010

Introduction: Children who are affected by chronic diseases such as thalassemia have more mental and social problems in compare with healthy people. Adopting to such conditions needs awareness of the ways to overcome these problems. Gaining life skill together with knowledge and science and appropriate change of attitudes, values and reinforcement of appropriate behaviors lead to normal behavio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011